TMDB is a movie database, the source of data for this assignment. The assignment focuses on analyzing and studying the data of 20th-century movies to show the trend in the movie market.
# the library needed
import csv
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
# !pip install squarify
import squarify
The API provides movie data provided by http://www.themoviedb.org/
The sample style of API: https://api.themoviedb.org/3/movie/550?api_key=798b9d46254c561c622bfa9ed97f64ba 
This API is free but requires a one-by-one data download. The key is: '798b9d46254c561c622bfa9ed97f64ba'.
key = '798b9d46254c561c622bfa9ed97f64ba'
Because it is necessary to get the data from the website one by one, the middle number part of the website link is truncated. Numbers will be added later.
# sample: https://api.themoviedb.org/3/movie/550?api_key=798b9d46254c561c622bfa9ed97f64ba&emsp
url_movie = 'https://api.themoviedb.org/3/movie/'
url_2= '?api_key=' + key
The obtained data is contained in JSON, then transformed into dictionary data. On request, the request is successful only if status_code = 200; otherwise, failure returns null.
def get_data(num):
r = requests.get(url_movie + num + url_2 )
if(r.status_code == 200):
content = json.loads(r.text)
return Appropriate_data(content)
return ''
Retain the fields that need to be analyzed.
# Required fields.
labels = ['budget','genres','original_language','popularity','production_countries','release_date','revenue','runtime','vote_average']
def Appropriate_data(content):
if 'success' in content.keys():
return ''
temp = {}
for i in content:
if (i == 'budget'):
temp[i] = str(content[i])
if (i == 'genres'):
n = ''
for j in content[i]:
for k in j:
if(k == 'name'):
n = n + j[k] +','
temp[i] = n
if (i == 'original_language'):
temp[i] = content[i]
if (i == 'popularity'):
temp[i] = content[i]
if (i == 'production_countries'):
n = ''
for j in content[i]:
for k in j:
if (k == 'name'):
n = n + j[k] + ','
temp[i] = n
if (i == 'release_date'):
temp[i] = content[i][0:4]
if (i == 'revenue'):
temp[i] = str(content[i])
if (i == 'runtime'):
temp[i] = content[i]
if (i == 'vote_average'):
temp[i] = content[i]
return temp
Data saved in CSV format. Collect valid data from the first 20,000 data items.
with open('csv_dct.csv', 'w',newline='') as f:
writer = csv.DictWriter(f, fieldnames=labels,)
writer.writeheader()
for i in range(20000):
dct_arr = [get_data(str(i))]
if(len(dct_arr[0]) != 0):
for elem in dct_arr:
writer.writerow(elem)
Outliers or missing values in the obtained data need to be processed further.
data = pd.read_csv('csv_dct.csv')
data.describe()
| budget | popularity | release_date | revenue | runtime | vote_average | |
|---|---|---|---|---|---|---|
| count | 1.209600e+04 | 12096.000000 | 12088.000000 | 1.209600e+04 | 12096.000000 | 12096.000000 |
| mean | 9.832034e+06 | 13.567830 | 1992.405113 | 2.713343e+07 | 99.175513 | 5.950118 |
| std | 2.382788e+07 | 22.203409 | 18.424728 | 8.982582e+07 | 29.909587 | 1.682774 |
| min | 0.000000e+00 | 0.600000 | 1888.000000 | 0.000000e+00 | 0.000000 | 0.000000 |
| 25% | 0.000000e+00 | 3.387000 | 1985.000000 | 0.000000e+00 | 89.000000 | 5.503750 |
| 50% | 0.000000e+00 | 9.084500 | 2000.000000 | 0.000000e+00 | 98.000000 | 6.300000 |
| 75% | 8.000000e+06 | 15.573500 | 2006.000000 | 1.277145e+07 | 110.000000 | 6.931000 |
| max | 3.800000e+08 | 596.230000 | 2018.000000 | 2.920357e+09 | 540.000000 | 10.000000 |
We can see that budget, revenue,vote_average, and runtime have a minimum value of 0. The values of these columns may be missing.
Due to a large amount of data, the missing values will be replaced by the average value.
data['budget'] = data['budget'].replace(0,data['budget'].mean())
data['revenue'] = data['revenue'].replace(0,data['revenue'].mean())
data['runtime'] = data['runtime'].replace(0,data['runtime'].mean())
data['vote_average'] = data['vote_average'].replace(0,data['vote_average'].mean())
data.describe()
| budget | popularity | release_date | revenue | runtime | vote_average | |
|---|---|---|---|---|---|---|
| count | 1.209600e+04 | 12096.000000 | 12088.000000 | 1.209600e+04 | 12096.000000 | 12096.000000 |
| mean | 1.570557e+07 | 13.567830 | 1992.405113 | 4.351533e+07 | 100.823518 | 6.220667 |
| std | 2.180626e+07 | 22.203409 | 18.424728 | 8.576570e+07 | 26.989108 | 1.071724 |
| min | 1.000000e+00 | 0.600000 | 1888.000000 | 1.000000e+00 | 1.000000 | 0.500000 |
| 25% | 9.832034e+06 | 3.387000 | 1985.000000 | 2.713343e+07 | 90.000000 | 5.700000 |
| 50% | 9.832034e+06 | 9.084500 | 2000.000000 | 2.713343e+07 | 98.000000 | 6.300000 |
| 75% | 9.832034e+06 | 15.573500 | 2006.000000 | 2.713343e+07 | 110.000000 | 6.931000 |
| max | 3.800000e+08 | 596.230000 | 2018.000000 | 2.920357e+09 | 540.000000 | 10.000000 |
Extract the data that will be used later.
def get_details(movie_data, movie_details):
dct = defaultdict(list)
count = 0
for i in movie_data:
count = count + 1
if (isinstance(i, list)):
for j in i:
if j != '':
dct[j].append(movie_details[count - 1])
for i in dct.keys():
dct[i] = pd.Series(np.array(dct[i]))
details = []
for i in dct.keys():
details.append(dct[i].mean())
return details
movie_type = data['genres'].str.split(',')
Popularity = get_details(movie_type,data['popularity'])
Runtime = get_details(movie_type,data['runtime'])
Vote_average = get_details(movie_type,data['vote_average'])
Budget = get_details(movie_type,data['budget'])
Revenue = get_details(movie_type,data['revenue'])
In this collection of 12,000 pieces of data spanning the 20th century, The distribution of movie genres can reflect which subjects are the most popular movies at the time.
movie_type = movie_type.apply(pd.Series)
movie_type = movie_type.apply(pd.value_counts)
movie_type = movie_type.unstack().dropna().reset_index()
movie_type.columns =['level_0','type','counts']
movie_type_m = movie_type.drop(['level_0'],axis=1).groupby('type').sum().sort_values(by=['counts'],ascending=False).reset_index()
types = [i for i in movie_type_m.type]
types = types[1:]
counts = [i for i in movie_type_m.counts]
counts = counts[1:]
colors = ['#98F5FF','#9999ff','#FF4040','#E3CF57', '#7FFF00','#FF9912','#FF7F24']
plot = squarify.plot(
sizes=counts,
color = colors,
label=types,
value=counts,
alpha = 0.6,
edgecolor = 'white',#
linewidth =3,
)
plt.rc('font', size=10)
plot.set_title('Types',fontdict = {'fontsize':20})
plt.axis('off')
plt.tick_params(top = 'off', right = 'off')
plt.show()
It can be seen that Comedy, Drama, Thriller, Action, Romance, and Crime movies are more popular, indicating that these subjects are more popular subjects, of which Drama movies account for the most.
plt.figure(figsize=(30,8))
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.dpi'] = 300
x = types
y = Popularity
plt.bar(x,y)
plt.xlabel('Types')
plt.ylabel('Popularity')
plt.title('Movie Type Popularity')
plt.show()
fig = plt.figure(figsize=(15,15))
plt.pie(Popularity,labels=types,autopct='%1.2f%%')
plt.title("Movie Type Popularity Pie chart")
plt.show()
It is worth noting that previous movies with a high number of genres are not very popular when it comes to popularity. On the contrary, the Family genre boasts the top spot, followed by Science Fiction and Animation. Crime genre films are abundant in number but low in popularity. People prefer family genre movies, which are related to life. Maybe crime genre movies are still not so morally popular.
plt.figure(figsize=(30,8))
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.dpi'] = 300
x = types
y = Runtime
plt.bar(x,y)
plt.xlabel('Types')
plt.ylabel('Runtime')
plt.title('Types And Runtime')
plt.show()
The vast majority of the films stay around 100 minutes. History and Tv movies may involve recording more details and reach 120 minutes or more. The shortest length movie is the family genre, less than 80 minutes.
plt.figure(figsize=(30,8))
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.dpi'] = 300
x = types
y = Vote_average
plt.bar(x,y)
plt.xlabel('Types')
plt.ylabel('Votes')
plt.title('Types And Votes')
plt.show()
Overall, most of the genre movies are rated around 6. The Documentary genre is slightly below the others.
movie_production = data.groupby('release_date')['original_language'].count()
plt.figure(figsize=(20,8),dpi=80)
x = movie_production.index.tolist()
y = movie_production
plt.plot(x,y,color= 'b')
plt.xlabel('Years')
plt.ylabel('Production')
plt.title('Years And Production')
plt.show()
The thematic part of the data is information on 20th-century cinema. Therefore, from 1900 to around 2000, the growth curve of the number of films over time can be seen. The more it goes to 2000, the more exponential growth is observed. Due to the lack of data afterward, the changes between 2000 and 2010 and beyond cannot be shown in the picture. However, according to experience, the number of movies in the 21st century will be higher.
movie_countries = data['production_countries'].str.split(',')
movie_countries = movie_countries.apply(pd.Series)
movie_countries = movie_countries.apply(pd.value_counts)
movie_countries = movie_countries.unstack().dropna().reset_index()
movie_countries.columns =['level_0','type','counts']
movie_countries_m = movie_countries.drop(['level_0'],axis=1).groupby('type').sum().sort_values(by=['counts'],ascending=False).reset_index()
counts2 = [i for i in movie_countries_m.counts if i > 100]
counts2 = counts2[1:]
types2 = [i for i in movie_countries_m.type]
types2 = types2[1:len(counts2)+1]
types2[0] = 'USA'
plt.figure(figsize=(30,8))
x = types2
y = counts2
plt.bar(x,y)
plt.xlabel('Countries')
plt.ylabel('Production')
plt.title('Countries And Production')
plt.show()
fig = plt.figure(figsize=(15,15))
explode = []
for i in y:
if ((i/(sum(y))>0.03)):
explode.append((i/(sum(y)*10)))
explode.append(0)
explode=explode[0:len(y)]
plt.pie(y,labels=x,autopct='%1.2f%%',explode = explode)
plt.title("Countries Pie chart")
plt.show()
As can be seen from the chart, the United States is the top movie producer, accounting for 47.28% of the total number of movies (which should also be related to the data included on the website). Germany, France, and UK follow this.
movie_language = data['original_language'].str.split()
movie_language = movie_language.apply(pd.Series)
movie_language = movie_language.apply(pd.value_counts)
movie_language = movie_language.unstack().dropna().reset_index()
movie_language.columns =['level_0','type','counts']
movie_language_m = movie_language.drop(['level_0'],axis=1).groupby('type').sum().sort_values(by=['counts'],ascending=False).reset_index()
counts3 = [i for i in movie_language_m.counts if i > 100]
types3 = [i for i in movie_language_m.type]
types3 = types3[0:len(counts3)]
plt.figure(figsize=(15,5))
x = types3
y = counts3
plt.bar(x,y)
plt.xlabel('Language')
plt.ylabel('Production')
plt.title('Language And Production')
plt.show()
fig = plt.figure(figsize=(15,15))
explode = [(i/(sum(y)*10)) for i in y ]
plt.pie(y,labels=x,autopct='%1.2f%%',explode = explode)
plt.title("Language Pie chart")
plt.show()
Since the U.S. is a major film-producing country, there is no doubt that the number of films in the English language genre will be the highest, at 73.93 percent, followed by German and French.
sns.displot(data['runtime'],bins=30,kde=True)
plt.show()
Most of the movies are around 100 minutes long. It shows a stepwise decrease towards the two poles.
plt.figure(figsize=(10,8))
plt.scatter(data['runtime'],data['popularity'])
plt.title('Runtime And Popularity',fontsize=15)
plt.xlabel('Runtime',fontsize=15)
plt.ylabel('Popularity',fontsize=15)
plt.show()
The vast majority of popular movies are around 100 minutes in length and between 100 and 200 minutes. The highest values appear in this interval.The longer the duration the movie and the shorter the duration of the movie are not as popular.
sns.displot(data['vote_average'],kde=True)
plt.show()
Most of the movies are rated in the range of 6-8. The largest number of movies with a rating of about 6. The number of movies with high and low scores is very rare, which is in line with the law of the market.
plt.figure(figsize=(10,8))
plt.scatter(data['runtime'],data['vote_average'])
plt.title('Runtime And Vote_average',fontsize=15)
plt.xlabel('Runtime',fontsize=15)
plt.ylabel('Vote_average',fontsize=15)
plt.show()
Most of the movies are rated in the range of 6-8. The largest number of movies with a rating of about 6. The number of high-rated movies is between 100 and 200 minutes long. With a runtime increase of more than two hundred minutes, the rating is not low.
plt.scatter(data['vote_average'],data['popularity'])
plt.title('Vote_average And Popularity',fontsize=15)
plt.xlabel('Vote_average', fontsize=20)
plt.ylabel('Popularity', fontsize=20)
plt.show()
Most popular movies have ratings between 6 and 9. However, there is no shortage of movies with high ratings but shallow popularity.
plt.style.use('ggplot')
plt.figure(figsize=(10,8))
re =[ i / 1000000 for i in data['revenue']]
plt.scatter(data['vote_average'],re)
plt.title('Vote_average And Revenue ',fontsize=15)
plt.xlabel('Vote_average',fontsize=15)
plt.ylabel('Revenue(Million)',fontsize=15)
plt.show()
Obviously, the higher the rating, the higher the movie tends to be at the box office. Nevertheless, there will still be some movies with higher ratings and fewer viewers.
# 5. 预算
# 5.2 预算和收入 两条线
movie = data.groupby('release_date')['revenue'].sum()
movie2 = data.groupby('release_date')['budget'].sum()
plt.figure(figsize=(20,8),dpi=80)
x = movie.index.tolist()
y = [i/1000000 for i in movie]
y2 =[i/1000000 for i in movie2]
plt.plot(x,y,color= 'r',label= 'Revenue') # 标签未加
plt.plot(x,y2,color= 'b',label= 'Budget')
plt.xlabel('Years')
plt.ylabel('Million')
plt.title('Revenue And Budget')
plt.legend()
plt.show()
Increasing the budget of a film can improve its quality of the film. This can be well reflected in the box office. Since the 20th century, budgets have been increasing, reaching a peak around the 21st century, which may have a lot to do with the development of video technology in the late 20th century. Due to the lack of data afterward, the second half showed a decline, which was not the case.
# 不同类型电影的预算
plt.figure(figsize=(30,8))
x = types
y = [i/1000000 for i in Budget]
plt.bar(x,y)
plt.xlabel('Types')
plt.ylabel('Budget(Million)')
plt.title('Types And Budget')
plt.show()
Adventure, Animation film budget is the highest, followed by Action. At that time, the cost of making this type of movie was very high due to technical constraints. Compared to the largest number of Drama, the Comedy genre does not have as high a budget.
plt.figure(figsize=(30,8))
x = types
y = [i/1000000 for i in Revenue]
plt.bar(x,y)
plt.xlabel('Types')
plt.ylabel('Revenue(Million)')
plt.title('Types And Revenue')
plt.show()
Adventure, Animation has high production costs and the highest box office revenues in this film category. Family and Science Fiction genre movies are not very good in the budget, but they are good in box office revenue. It may also be a wider audience. The crime genre is the lowest in terms of investment and box office revenue.
fig = plt.figure(figsize=(20,20))
sns.heatmap(data.corr(),vmax = 1, annot = True,linewidths = 0.5, cbar=False, cmap='YlGnBu',annot_kws={'fontsize':25})
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.title('Features',fontsize = 20)
plt.show()
Adventure, Animation has high production costs and the highest box office revenues in this film category. Family and Science Fiction genre movies are not very good in the budget, but they are good in box office revenue. It may also be a wider audience. The crime genre is the lowest in terms of investment and box office revenue.
About twelve thousand valid movie data were analyzed for this assignment, mainly distributed in the 20th century. The information was analyzed mainly in genre, duration, region, rating, budget, and box office.
1. The number of films shows a trend of increasing yearly, due to the data collection problem after 2005 began to decline. Moreover, the top 6 movie genres are Drama, Comedy, Thriller, Action, Romance, and Crime. Family, Animation, Science Fiction, Adventure, and Action are popular film genres.
2. The United States is the primary producer of movies, accounting for 47.28%, followed by Germany, France, and the United Kingdom. English language films accounted for 73.93 %, followed by German and French.
3. The most popular movies' length is between 90-120 minutes. Movies that are too long or too short are less popular. Longer movies tend to have higher ratings.
4. The ratings are mainly concentrated on 6-8 points. High movies tend to last longer. Movies with high ratings are not necessarily popular, and there may be good cold movies.
5. With a high budget investment, the quality of the movie will be higher, and the box office obtained will be higher. The two types of movies with high budgets and high box office are Adventure and Animation.
6. The correlation between budget, popularity, and box office is the strongest. The high box office is usually associated with high popularity and a high budget.
In the 21st century, the drama, comedy, and family genres have endured. Thanks to the constant experimentation and accumulation of the 20th century. In the 21st century, with the introduction of new technologies and higher budgets, the film industry has taken on a higher dimension. However, the future is full of all kinds of uncertainties. Newer types of films may be created, and newer film technologies may disrupt the existing model.